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|  1.  INTRODUCTION 

i 

i 

1.1  Overview  of  the  Adaptive  Time  Series  Analysis  Problem 

Adaptation  in  time  series  is  an  important  problem  in  a  number  of  DOD 
systems  and  has  many  applications  in  various  commercial  industries.  This 
|  is  an  especially  difficult  problem  in  problems  requiring  realtime  adap- 

;  tation  to  process  changes  since  such  a  procedure  would  have  to  be  comple¬ 

tely  automatic  and  reliable.  Adaptation  is  necessary  in  systems  where  the 
dynamical  characteristics  change  with  time  in  unpredictable  ways,  or  where 
the  noise  disturbance  process  characteristics  vary  with  time.  Examples  of 


systems  that  require  adaptive  time  series  analysis  are  the  adaptive 
suppression  of  aircraft  wing  flutter,  identification  of  the  dynamics  of 
large  flexible  space  structures,  detection  of  failures  in  aircraft  from 
subsystem  failures  of  battle  damage,  identification  of  missile  aerodyna¬ 
mics,  target  tracking,  and  various  signal  processing  problems. 

The  solution  to  the  adaptive  time  series  analysis  requires  several 
advances  in  current  time  series  methods.  At  the  core  of  the  problem  is  the 
need  for  a  fundamental  statistical  approach  to  the  adaptation  problem  that 
poses  the  problem  in  a  meaningful  way  and  that  leads  to  computable  solu¬ 
tions.  To  solve  the  online  adaptation  problem,  a  reliable  and  automatic 
time  series  modeling  procedure  is  required  that  Is  lacking  in  previous 
methods.  The  current  research  provides 

•  A  sound  statistical  basis  for  posing  and  solving  the  adaptation 
problem 


-1- 


WWW 


- .  * v 
•  V  v 


•  A  numerically  and  statistically  reliable  online  computational 


procedure 

This  approach  has  been  used  in  conjunction  with  a  new  high  resolution 
system  identification  method  utilizing  canonical  variate  analysis  (CVA)  for 
the  determination  of  the  dynamics  of  high  order  multisensor  systems  with  a 
small  data  length  (Larimore,  1983b).  This  algorithm  can  be  implemented  on 
highly  parallel  processors  such  as  a  systolic  array.  This  makes  practical 
the  consideration  of  many  different  system  characteristics  to  determine  the 
best  for  modeling  the  observed  sensor  data  and  correlational  relationships 
between  the  many  sensors.  The  system  characteristics  that  have  been  suc¬ 
cessfully  determined  adaptively  are  the  dynamical  state  order  of  the 
system,  the  presence  of  correlated  disturbances,  the  optimal  data  length  to 
use  in  tracking  a  time  varying  system,  and  the  optimal  data  interval  for 
detection  of  an  abrupt  change  or  other  event  in  the  data. 

The  CVA  time  series  analysis  method  has  been  applied  to  the  design  of 
an  adaptive  flutter  suppression  problem  for  suppressing  wing  flutter  or 
aero-structural  vibration  in  aircraft.  While  considerable  progress  has 
been  made  in  the  problem  of  adaptation  in  terms  of  identification  of  time 
series  models,  adaptive  time  series  methods  which  can  efficiently  track  and 
detect  time  varying  processes  would  further  improve  the  svstem.  In  such  a 
system  the  wing  dynamical  characteristics  can  change  instantaneously  when  a 
wing  store  is  dropped,  and  the  new  wing  dynamics  are  unknown  and  may  be 
unstable  resulting  in  a  growing  oscillation.  If  the  unstable  mode  is  not 


detected,  accurately  identified,  and  stabilized  by  control  feedback  in  less 


than  a  second,  then  the  aircraft  can  lose  a  wing.  The  CVA  algorithm  using 
entropy  methods  for  deciding  model  state  order  are  being  implemented  on  a 
vector  array  processor  which  will  identify  high  order  systems  with  dozens 
of  dynamical  states  and  multiple  inputs  and  outputs  in  fractions  of  a 
second.  This  system  has  been  tested  in  real-time  simulations,  and  was  suc¬ 
cessfully  demonstrated  in  wind  tunnel  tests  at  the  NASA  Langley  Transonic 
Dynamics  Wind  Tunnel.  It  is  expected  that  highly  parallel  processors  such 
as  systolic  array  processors  could  result  in  a  speedup  of  many  thousands  of 
times  which  would  be  required  for  some  very  large  scale  real  time  adaptive 
problems. 


1.2  Signal  and  Fault  Detection 


A  Comprehensive  survey  of  fault  detection  methods  is  given  by  Willsky 
(1976).  See  also  Mehra  and  Peschon  (1971),  Willsky  and  Jones  (1974), 
Willsky  (1980),  and  Isermann  (1984).  The  type  of  abrupt  changes  in  a 
system  that  are  considered  are  of  the  form 


x(t+l)  *  <&x(t)  +  Gu(  t )  +  w(t)  +  m(t) 


(1.1) 


y(t)  *  Hx(t)  +  Au(t)  +  Bw(t)  +  v(t)  +  N(t) 


(1.2) 


where  u  is  the  input  vector  process,  y  is  the  output  vector,  x  is  the  state 
vector,  and  w  and  v  are  white  noise  processes  that  are  independent  with 
covariance  matrices  Q  and  R  respectively.  These  white  noise  processes 
model  the  covariance  structure  of  the  error  in  predicting  v  from  u.  The 
abrupt  changes  are  in  the  form  of  the  time  the  functions  m(t)  and  n(t) 
introduced  into  the  state  and  observation  equations.  Fault  detection  is 
thus  the  detection  of  the  presence  of  such  nonzero  functions. 


For  various  hypothesized  forms  of  the  functions,  i.e.  ,  for  jumps  in 
various  components  or  specific  combinations  of  the  components,  a  particular 
detection  computation  is  devised  which  requires  implementation  of  a  Kalman 
filter.  This  leads  to  statistically  most  powerful  likelihood  ratio  tests 
of  the  various  failure  hypotheses.  An  optimal  solution  to  the  failure 
detection  problem  formulated  in  (1.1)  and  (1.2)  is  thus  obtained. 

There  are  however  several  more  general  failure  detection  problems  not 
of  the  form  of  (1.1)  and  (1.2).  The  approach  permits  only  the  consideration 
simple  hypotheses,  i.e.,  where  the  failure  functions  m(t)  and  n(t)  are  of 
the  form  of  an  unknown  scalar  amplitude  parameter  multiplying  a  function  of 
known  form.  More  general  functional  forms  such  as  two  components  with  dif¬ 
ferent  unknown  amplitude  parameters  multiplying  the  known  functions 
requires  maximum  likelihood  parameter  identification  at  considerable  com¬ 
putational  expense  and  loss  of  numerical  reliability.  Furthermore,  the 
problem  of  unknown  failure  time  leads  to  a  considerable  increase  in  the 
required  computation,  and  no  theoretically  sound  decision  procedure  has 
been  proposed  for  choosing  the  failure  time. 

The  general  case  of  changes  in  the  system  dynamics  or  correlational 
characteristics  of  the  disturbance  or  measurement  noise  processes  cannot  be 
handled.  Such  cases  require  general  time  series  analysis  parameter  iden¬ 
tification  methods  which  are  not  reliable  for  online  application  to  high 
state  order  multivariable  systems  as  discussed  in  Section  Multisensor 
System  Identification.  Isermann  (1984)  gives  a  survey  of  current  fault 
detection  methods  and  concludes  that:  "A  unique  calculation  of  the  process 


coefficients  and  a  parameter  estimation  with  high  precision  is  only 
possible  for  low  order  elements  between  measured  variables.  Therefore  the 
measured  variables  should  be  selected  such  that  the  process  is  divided  in 
first  order  elements  or,  in  other  words,  all  state  variables  should  be 
measurable.  Easy  to  implement  parameter  estimation  methods  for  continuous¬ 
time  modles  to  be  used  on-line,  real-time  and  in  closed  loop  need  to  be 
developed."  The  requirement  of  measuring  all  of  the  states  is  not 
realistic  in  most  situations  especially  in  general  multivariate  time  series 
and  system  identification  problems.  Fortunately,  the  CVA  system  iden¬ 
tification  method  does  not  require  this,  but  indeed  is  an  online,  real-time 
method  that  gives  the  same  accuracy  in  either  open  or  closed  loop. 

The  issues  of  adaptation  are  not  addressed  in  the  fault  detection 
literaure  except  in  simplistic  ways.  The  present  state  of  the  art  in  adap¬ 
tation  for  failure  detection  appears  to  be  the  work  of  Hagglund  (1983) 
discussed  in  the  next  section,  and  is  -just  beginning  of  adaptive  approaches 
which  consider  fundamental  issues  in  adaptation. 


Concepts  of  adaptive  systems  have  been  around  since  the  1950' s 
involving  various  senses  of  adaptation.  The  present  literature  on  the  sub¬ 
ject  includes  a  number  of  methods  such  as  recursive  computational  schemes, 
exponential  forgetting,  lattice  computational  methods,  etc.,  which  have 
certain  "knobs"  that  allow  tuning  of  the  algorithm  to  accommodate  changes 
in  the  characteristics  of  the  actual  processes.  Reviews  of  these  and 
related  methods  are  contained  in  several  recent  special  issues  of  technical 


journals  and  books  (Special  Issue  on  Adaptive  Control,  Automatics,  Vol.  20, 
No.  5,  1985;  Special  Issue  on  Linear  Adaptive  Filtering,  IEEE  Trans,  on 
Information  Theory,  Vol.  30,  No  2,  1984;  Honing  and  Messerschmitt ,  1984). 
While  these  methods  do  permit  some  degree  of  adaptation  to  process  changes, 
the  methods  of  adaptation  are  ad  hoc,  and  no  sound  underlying  statistical 
principle  for  adaptation  is  proposed  or  demonstrated.  As  might  be 
expected,  these  methods  can  work  poorly  on  certain  cases  because  of  the 
lack  of  a  sound  statistical  basis. 

In  particular,  the  recursive  prediction  error  and  lattice  methods  are 
convenient  due  to  their  recursive  form  and  provide  an  estimate  at  every 
observation  (Friedlander ,  1982a,  1982b,  1983;  Ljung  and  Soderstrom,  1983). 
Also,  the  recursive  algorithms  can  be  sued  for  adaptation  by  exponential 
weighting  of  the  past  data  (Wellstead  and  Sanoff,  1981;  Irving,  1979;  Evans 
and  Betz,  1982).  But  the  rational  for  exponential  weighting  has  not  been 
given  a  sound  fundamental  justification,  but  is  used  largely  due  to  its 
ease  of  use.  the  choice  of  the  exponential  weight  has  been  ad  hoc  and 
susceptible  to  misinterpretation  of  changing  noise  variance  levels  as  time 
varying  changes  in  the  dynamics  (Hagglund,  1983). 

The  fundamental  problem  in  adaptive  time  series  analysis  is  adaptation 
to  time  varying  processes.  The  essential  problem  is  the  determination  of 
the  characteristics  describing  the  rate  at  which  the  process  is  changing. 
This  problem  has  received  very  little  in-depth  treatment  in  the  literature. 
Most  of  the  difficulty  can  be  attributed  to  the  discrepancy  between  the 
true  and  assumed  uncertainty  in  the  measurements.  Adaptive  control  schemes 


are  notoriously  optimistic  about  the  quality  of  the  parameter  estimates 
because  the  time  varying  nature  of  the  process  is  ignored. 

A  notable  exception  is  the  recent  work  of  Hagglund  (1983)  which  takes 
an  information  handling  point-of-view.  This  approach  leads  to  a  more 
realistic  appraisal  of  the  accuracy  of  the  parameter  estimates  and  con¬ 
sequently  the  value  of  new  measurements  which  become  available  in  time. 

Two  classes  of  time  varying  systems  are  considered: 

•  Processes  with  abrupt  changes 

•  Processes  with  slowly  varying  changes. 

Within  each  of  these  classes,  changes  are  considered  in  the  process  dyna¬ 
mics  and/or  noise  variance. 

For  abrupt  changes,  the  fault  detection  approach  is  taken.  The  central 
idea  is  to  monitor  differential  changes  in  the  parameter  estimates  to 
detect  abrupt  changes.  A  new  procedure  is  derived  by  Hagglund  which 
requires  no  apriori  information  and  is  very  sensitive  to  jumps  in  the  para¬ 
meters.  This  procedure  is  shown  to  have  very  good  properties  in  both 
theory  and  practice.  This  works  well  for  parameters  of  the  dynamics  as 
well  as  those  of  the  noise  variances  in  the  simple  cases  of  low  order 
systems. 

The  problem  of  slowly  varying  parameters  has  plagued  many  adaptive 
control  schemes.  Although  the  concept  of  discounting  the  old  data  using  a 
forgetting  factor  has  been  in  use  for  a  long  time,  the  problem  of  how  to 


relate  this  factor  to  the  data  has  been  elusive.  The  principal  proposed  by 
Hagglund  is  to  discount  past  data  in  such  a  way  that  a  constant  amount  of 
information  would  be  retained  if  the  parameters  were  constant.  The  quan¬ 
titative  measure  of  the  information  used  is  the  inverse  of  the  parameter 
estimation  error  covariance  matrix  which  is  the  Fisher  information  matrix. 
Theory  and  simulations  show  that  this  works  quite  well  in  low  order  and 
well  conditioned  systems.  However  for  high  order  and  multisensor  systems 
with  illconditioned  parametric  structure,  the  algorithms  are  not  so  well 
behaved. 

1.4  Multisensor  System  Identification 

System  parameter  identification  from  observed  measurements  is  a  crucial 
part  of  the  adaptive  multivariate  timeseries  analysis  problem.  It  is 
necessary  to  adapt  anot  only  to  changes  in  the  input  to  output  charac¬ 
teristics  of  a  system,  but  the  correlational  characteristics  of  the  distur¬ 
bance  and  noise  processes  must  simultaneously  be  determined.  The 
feasibility  of  adaptive  methods  requires  first  that  a  reliable  online 
multivariate  time  series  identification  procedure  be  available. 

There  are  several  difficulties  with  currently  available  methods  and 
software  for  the  identification  of  system  dynamics  and  noise  charac¬ 
teristics.  Current  methods  include  the  self  tuning  regulator  (STR)  (Ljung, 
1983;  Astrom,  1973;  Astrom  et  al,  1973,  1977),  maximum  likelihood  estima¬ 
tion  (MLE)  (Mehra  and  Tyler,  1973;  Larimore,  1981a),  Box-Jenkins  (BJ) 
methods  (Box  and  Jenkins,  1976),  and  a  variety  of  heuristic  approaches. 

The  current  state  of  the  art  in  both  MLE  and  BJ  require  that  an  analyst  he 


involved  in  the  procedure,  and  the  required  number  of  computational  itera¬ 
tions  is  not  bounded.  The  STR  has  been  applied  successfully  to  simple  pro¬ 
cesses,  but  is  not  completely  reliable  for  general  processes  particularly 
when  multi-input,  multi-ouput  systems  are  involved.  In  addition,  the 
recursive  prediction  error  algorithm  used  in  the  STR  requires  a  good  ini¬ 
tial  estimate  and  so  is  not  suitable  for  short  data  where  no  apriori  data 
is  available.  The  heuristic  approaches  tend  to  be  special  purposes  and  are 
rather  unreliable  in  general  applications. 

Of  the  current  approaches  to  multivariate  time  series  identification 
which  are  high  resolution,  i.e. ,  make  efficient  use  of  the  observational 
information,  most  use  the  ARMA  (autoregressive  moving  average)  represen¬ 
tation  for  the  process.  For  multi-input  multi-output  systems  this  is  not  a 
globally  well  defined  parameterization  which  is  a  major  cause  of  the  dif¬ 
ficulties  In  the  present  identification  methods  (Gevers  and  Wertz,  1982).  A 
consequence  is  that  there  is  no  single  parameterization  which  is  numeri¬ 
cally  well  conditioned,  and  known  algorithms  can  be  made  to  fail  for  a  par¬ 
ticular  choice  of  system.  The  system  identification  problem  is  well 
defined  in  that  the  class  of  models  does  have  best  models  in  a  maximum 
likelihood  sense  (Larimore,  1981a),  but  the  ARMA  parameterization  is  not 
unique  so  that  for  cases  such  as  pole-zero  cancellation  there  is  a  whole 
equivalence  class  of  models  with  equivalent  characteristics.  In  the  sequel 
this  difficulty  in  parameterization  will  be  resolved  by  the  use  of  state 
space  models,  and  stable  numerical  methods  will  be  described  for  statisti¬ 
cally  reliable  online  identification  of  multivariable  time  series. 
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1.5  Adaptive  Time  Series  Analysis  Using  Predictive  Inference  and  Entropy 

Recently  a  very  general  predictive  inference  approach  to  statistical 
modeling  has  led  to  a  fundamental  statistical  inference  justication  of 
negative  entropy  as  the  natural  measure  of  model  approximation  error 
(Larimore,  1983a).  This  development  has  a  number  of  very  attractive 
features : 

®  It  applies  to  completely  general  modeling  problems  including 
nonparametric  methods. 

•  It  applies  exactly  to  small  samples. 


•  Only  the  fundamental  statistical  principlaes  of  sufficiency  and 
repeated  sampling  are  used. 

•  It  applies  to  time  correlated  problems  such  as  time  series  model 
identification  and  tracking. 

•  Statistical  inference  can  be  fundamentally  viewed  as  model 
approximation. 

Early  developments  in  predictive  distributions  are  very  old,  although 
modern  approaches  apparently  begin  with  Jeffreys  (1961,  p. 143)  who  used  a 
Bayesian  approach,  as  has  much  of  the  work  following  (Atchison  and 
Dunsmore,  1975,  preface  and  p.  39).  The  approach  taken  here  has  been  sti¬ 
mulated  by  Murray  (1977,  1979),  the  work  of  Akaike  (1973)  and  model  struc¬ 
ture  determination  problems  (Larimore,  1977a). 
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1.6  Initial  Results  Indicating  Feasibility 

SSI  has  been  in  the  forefront  in  developing  the  CVA  and  entropy 
methods.  Here  the  related  projects  are  discussed  along  with  preliminary 
results  indicating  the  feasibility  of  the  proposed  methods. 

The  original  stochastic  realization  method  of  Akaike's  (1975)  was 
further  developed  into  a  commercial  software  package  for  mainframe  and  mini 
computers  by  Mehra  (1978)  and  Mehra  and  Cameron  (1976,  1980).  Further 
generalizations  to  input  output  systems  along  with  refinements  in  com¬ 
putational  speed  and  accuracy  were  developed  by  larimore  (1983b)  and 
Goodrich  and  Larimore  (1983)  leading  to  the  current  timeseries  analysis  and 
forecasting  package,  Forecast  Master  (Trademark  of  SSI),  for  the  IBM/PC. 
This  package  is  in  widespread  use  in  utilities,  banks  companies  and 
universities. 

This  algorithm  has  been  the  basis  for  several  studies  in  online  systems 
identification.  The  project  "Basic  Research  in  Adaptive  Model  Algorithmic 
Control"  used  the  online  CVA  system  identification  algorithm.  In  the 
current  study  "Reconfiguration  Control  Strategies",  the  CVA  method  along 
with  adaptive  tracking  and  detection  methods  are  being  studied.  The  pre¬ 
sent  theory  on  adaptation  using  entropy  methods  (Larimore,  1985a)  was  deve¬ 
loped  under  the  basic  research  study  "traget  Dynamic  Modeling"  and  under 
the  study  "Development  of  Statistical  Methods  Using  Predictive  Inference 
and  Entropy”  which  was  Phase  I  of  this  proposed  Phase  II  study. 


A  review  of  the  technology  in  system  identification  and  adaptive 
control  for  adaptive  methods  applicable  to  the  suppression  of  aeroelastic 


\ 
\ 
1 
3 

i 
i 

wing  vibration  (flutter)  was  done  in  Larimore  and  Mehra  (1984).  This  study  ^ 

i 

describes  the  deficiencies  of  current  methods  and  suggets  the  feasibility  ! 

I 

I 

of  CVA  and  entropy  methods  for  fully  adaptive  online  detection  and  tracking 

of  wing  flutter.  In  a  current  study  with  General  Dynamics  sponsored  by  the  J 

Air  Force  Wright  Aeronautical  Laboratories,  CVA  has  been  analyzed  exten-  ! 

( 

I 

sively  in  computer  simulations,  real  time  tests,  and  demonstrated  wind  tun-  , 

| 

nel  teste  for  adaptive  flutter  suppression.  The  ability  of  CVA  to  identify 
very  complex  flutter  dynamics  of  high  state  order  involving  very  closely 

j 

spaced  spectral  peaks  in  the  presence  of  correlated  wind  gust  disturbances 

I 

using  short  data  lenghts  demonstrated  the  consderable  statistical  accuracy 

of  the  method.  The  online  CVA  identification  algorithm  was  demonstrated  1 

i 

ina  wind  tunnel  test  at  the  NASA  Langley  Transonic  Dynamics  Wind  Tunnel  on 
a  1/4  scale  model  of  an  F-16  aircraft. 

1.7  Synopsis  of  Report 

In  Section  2,  we  present  a  detailed  and  transparent  derivation  of  an 
unbiased  entropy  measure  which  will  be  used  in  the  sequel  for  adaptive 
estimation.  This  measure  is  asymptotically  equal  to  Akaike's  AIC  cri¬ 
terion.  In  Section  3,  we  present  a  detailed  description  and  derivation  of 
linear  least-squares  prediction  using  canonical  variates  analysis  (CVA). 

Several  new  forms  for  these  predictors  are  given.  In  Section  4,  a  method 
for  direct  determination  of  the  parameters  of  the  Kalman  filter  in  canoni¬ 
cal  form  is  given,  and  is  shown  to  be  equivalent  to  a  truncated  optimal 
linear  predictor  derived  using  CVA.  Section  5  considers  the  model  order 


selection  problem,  using  an  entropy-based  approach.  The  problem  of  abrupt 


,u  .*C  V|  |  i  >  t.i  4  >  I.I’J  4 


change  detection  using  entropy  methods  is  considered  in  Section  6  and  a 


specific  algorithm  is  derived  and  tested.  In  Section  7  we  consider  the 


problem  of  slow  change  detection,  specifically  the  problem  of  finding  the 


optimal  data  length  for  model  fitting  when  the  time  series  coefficients  are 


slowly  varying.  An  entropy-based  algorithm  is  developed  and  tested. 
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2.  PREDICTIVE  INFERENCE  AND  ENTROPY 


2.1  Introduction 

In  this  section  we  develop  the  necessary  background  for  development  of 
adaptive  estimation  algorithms  in  the  sequel. 

The  problem  under  consideration  is  that  of  predicting  the  future  evolu¬ 
tion  of  a  time  series,  given  some  observations  of  the  past.  The  predictive 
inference  framework  may  be  described  as  follows. 

We  assume  that  the  density  function  of  interest  is  parametrized  by  a 
parameter  vector  0  e  Rm  and  is  denoted  by  p(x  |  0).  For  the  purposes  of 
discrimination  between  two  alternatives  0]  and  Oq  it  can  be  shown  (Akaike, 
1973)  that  all  necessary  information  is  contained  in  the  likelihood  ratio 


p(x  0l) 

P(x  0O) 


(2.1) 


Thus,  the  mean  amount  of  information  for  discrimination  when  p(x  ©o) 


is  the  true  density  is  of  the  form 


where  <)>(.)  is  a  properly  chosen  function.  It  can  be  argued  using  infor¬ 
mation  theoretic  arguments  (Akaike,  1973)  that  the  only  appropriate  form  is 


4>(y)  “  log  y 


(2.3) 


which  leads  directly  to  the  measure 


®(®1 .  ®o)  ■  /  P(*  I  0o)  loR 


1  '  i  *  * 

p(x)Oq 


(2.4) 


Note  that  -  B(©i,  0g)  is  the  Kul lback-Liebler  information  for  discrimi 
nation  in  favor  of  ©g.  It  can  be  easily  shown  that  B(©i,  Oq)  <  0  and 
equality  holds  if  and  only  if  p(x  |  ©[)  =  p(x  |  Oq)  almost  everywhere 
(Aitchison  and  Dunsmore,  1975). 


Note  that  B(0j,  0g)  can  be  written  as 


B(0i,  ©g)  “  J  p(x  |  0o)  log  p(x  j  ©i)  dx 
-  I  p(x  j  ©o)  log  p(x  |  00)  dx 


(2.5) 


Since  ©o  represents  the  true  (unknown)  parameter,  our  objective  is  to  find 

A  A 

the  parameter  estimate  ©  which  maximize  B(©,  Qq).  From  (2.5),  we  need 
only  maximize 


/  p(x  Oq)  log  p(x  ©)  dx 


with  respect  to  ©  to  produce  our  estimate.  This  estimate  maximizes  the 
expected  log-likelihood  and  is  thus  a  maximum  -  likelihood  estimate. 

2.2  Preliminaries 

In  order  to  present  a  clear  development,  we  will  work  in  a  partitioned 
sample  space.  The  random  variable  x  is  presumed  to  be  in  n  -  dimensional 
Euclidean  space,  x  e  Rn,  and  Rn  is  partitioned  into  s  mutually  disjoint 
regions  Qj,  4^2 »  •  •  •  ,  which  cover  Rn: 
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fll  U  «2  u  •  •  •  u  ns  “  R 
flj  n  Oj  -  0  ;  i*j 


We  then  define 


Pi(0)  “  /  p(x  |  0)  dx 

a* 


(2.6) 


1  3  L  ,  2  ,  •  .  .  I  s 


We  consider  two  different  samples,  an  informative  sample  q  and  a  pre¬ 


dictive  sample  r.  The  informative  sample  is 


Xq  *  {xqi»  Xq2  •  •  •  •  »  xqnql 


which  consists  of  nq  observations  of  x.  The  predictive  sample  is 


xr  “  lxrl»  xr2»  •  •  •  >  xrnri 

consists  of  nr  observations  of  x.  We  assume  that  nqj  of  the  informative 
samples  fall  into  and  that  nr^  of  the  predictive  sample  fall  into 


I  nq  i  *  nq 
i*l 


l  nri  *  nr 
i-1 


(2.7) 


The  two  samples  Xq  and  xr  are  from  the  true  distributic 


Thus  we  have,  approximately,  for  sufficiently  large  samples, 


PriW 


(2.9) 


nri 

a*  - 

nr 

and  we  assume  regularity  conditions  throughout  such  that 

Pq(x  |  00)  =  lim  Pqi(Q0) 
nq  00 

s  *  <•> 

where 

lim  =*  x 

s  -*•  ® 

x  £ 

and  similarly  for  pr(x  |  0q).  The  computation  of  the  probabilities  asso¬ 
ciated  with  the  parametrized  densities  is  different.  Here  we  use  the  defi¬ 
nition  (2.6)  and  note  that  p^(0)  is  computable  from  p(x  |  0)  and  knowledge 
of  In  practice,  this  computation  need  not  be  done,  as  become  clear  in 

the  sequel. 

2.3  Entropy  and  Maximum  Likelihood  Estimation 

The  first  step  in  our  development  is  to  form  the  maximum-likelihood 
estimate.  This  is  done  by  maximuzing  (2.5)  on  the  informative 

sample: 

A 

0  «  arg  max  Bq(0,  00) 

0 


where 


Solving  (2.12)  would,  in  principal,  give  the  maximum-likelihood  esti¬ 
mate  if  the  dimension  of  0  were  known.  However,  in  practice,  the  actual 
dimension,  m,  of  9  is  not  known.  Furthermore,  there  is  an  obvious  tradeoff 
between  the  dimension  of  our  estimate  0  and  prediction  error.  Assume 

A 

9  e  Rk.  Then  as  we  increase  k,  the  fit  error  on  the  informative  sample 
will  decrease  momotonical ly.  However,  at  some  point  we  are  in  danger  of 
overfitting  the  model  so  that  0  is  a  function  of  the  sampling  error  on  the 
informative  sample.  When  this  happens,  the  fit  errors  on  the  predictive 
sample  will  begin  to  increase. 

If  we  assume  that  the  true  parameter  vector  dimension  is  m  and  that  the 
estimated  parameter  dimension  is  k  <  m,  then  our  objective  is  to  evaluate 
the  information  measure  on  the  predictive  sample  and  select  the  model  which 


maximizes  this  measure.  The  discrimination  measure  is  now  separated  into 


two  parts  in  order  to  simplify  the  analysis: 

*k  *  Pri(©k> 

Br(0,  0O)  =  J  PriO0>l°g  5T~ 

V  Xl  Pri(0k)  ?  x,  Pri(0O> 

"  L  Pri(Q0)log  - — TT  “  l  Pri(0o)log  - 

i-1  Pri(0k)  i=l  Pri(0k) 

»  Br(©k,  0k)  _  Br(0o,  0k)  (2.13) 

where  0k  e  Rk.  Both  entropy  measures  are  measured  with  respect  to  the  den¬ 
sity  PriCOk)  an(j  ok  £S  arbitrary.  We  will  in  the  sequel  pick  ©k  in  a  par¬ 
ticular  manner  which  clarifies  and  simplifies  the  development.  The 
decomposition  of  (2.13)  is  done  to  clarify  the  expo  ition  and  to  make  clear 
the  crucial  role  played  by  the  number  of  parameters  k.  The  summations  in 
(2.13)  are  taken  with  respect  to  the  true  density  on  the  predictive  sample 
while  0k  is  the  estimate  computed  on  the  informative  sample.  Thus, 

Br(Qk.  0O)  ^  a  measure  of  the  information  between  the  estimated  den¬ 
sity  and  the  true  density  on  the  predictive  sample.  Since  the  informative 
sample  is  known  but  the  predictive  sample  is  not  we  will  use  statistical 
mean  values  in  the  sequel. 

In  order  to  evaluate  Br(0kj  0k)  ancj  Br(©o»  Qk)  we  will  expand  around 
the  actual  probabilities  on  the  informative  sample. 


Evaluation  of  Br(0k>  Qk) 


ii  Vi  Ji  A 


(i>  *,»  f.« 


From  (2.13) 


Br(0k,  0k)  =  l  pri(©o)  (log  Pri(°k)  “  log  Pri(0k)l  (2.14) 

i=l 

Define  the  sampling  error  between  the  informative  and  predictive  proba¬ 
bilities  as 


(2.15) 


et(0)  =  Prl(0)  -  Pqi(C> 

Expanding  the  log  term  to  second  order  yields 


,  3 log  Pqi(0  )  , 

log  pri(0k)  =  log  pql(0k)  +  - g-p---q -  et(0k) 


1  3  logPqi(0k)  3logPqi(9k)  . 

+  - - - 3 -  ei2(Gk)  +  - — 3i -  (Ok  _  0k) 

2  3p  2  30k 

qi 

+  I  (0k  -  0k)T  -a..1.°.^ll(.0  }  (6k  -  0k)  +  (0k  _  ok  )T  1,3gpoln  )  e  (0k) 
Z  lC\kZ  30k  3r  1 


(2.16) 


Br(0k,  0k)  =  I  Pri(0O)  ,.3.log  p.ql(0  }  (5k 


+  T  l  PriW  (5k  -  Qk)T  —  7Pq.l(°  (0k  -  Qk) 


+  l  Pri(0o)  (0k  ~  0k)T  —  L°v  Pql(°  }  et(0k) 

i=l  30*  3pqi 


(2.17) 


This  expression  can  be  further  simplified  by  utilizing  the  fact  that, 
since  ©k  is  a  maximum-likelihood  estimate  on  the  informative  sample: 


i 

5 

I 

\ 

} 


I  Pqi(0o) 

i-1 


3log  pql(0k) 
30k 


0 


Expanding  this  around  0k  yields 


(2.18) 


i 


l 


1=1 


Pqi(0O) 


3 log  pqi(0  ) 
30k 


32log  pql(0k) 
30k2 


(0k  _  0k) 


0 

(2.19) 


Using  (2.15)  and  (2.19)  and  in  (2.17)  yields  j 

Br(0k,  0k)  »  l  e  (Q  )  .I1.08  Pql(0  }  (0k  .  Qk) 

1=1  30k 

if  A,  ,  T  3“log  Pol(Qk) 

+  j  l  ei(0o)  (0k  -  0k)  - * J-311 — 1_  (0k  _  0k)  ; 

1  l=l  3©k2  \ 

~  j  l  Pqi(0o)  (0k  "  0k)T  — ^P-qi--°  }  (0k  -  0k)  ! 

1=1  30k2 

f  „  „  32log  Pn 1 ( 0k )  I 

+  l  [Pqi(0o)  +  ei(0g)]  (©k  -  0k)T  - — 3 -  ej_(O0)  (2.20) 

i=l  30*  3pqi 

where  we  have  assumed  ei(©k)  »  ^(Oq).  1 

i 

The  error  ej/Oo)  is  the  difference  of  two  probabilities,  which  are 
binomially  distributed,  by  construction: 

. 


el(Q0)  =  Pri(0o)  ~  Pqi(0Q) 


ITVVV.  V,  V.- VJ  v-J  K\|  IT.  YV-v 


J  *1 


Furthermore  £  ei(0o)  *  0,  by  definition. 

1-1 


Since  we  are  assuming  here  that  pri(©o)  and  pqi(0())  are  independent  samples 


from  the  same  underlying  distribution,  e^CGg)  is  unbiased: 


E  lei(®o)l  “  0 


(2.21) 


where  E  {  }  denotes  expectation  with  respect  to  all  underlying  random 
variables.  Recalling  that  the  informative  sample  is  of  size  nq  and  the 


predictive  sample  is  of  size  nr,  pqi(0Q)  has  approximate  variance 


var  (Pql(0O))  “-^-Pi^o)  (1  -  PiWl 
nq 


and  Pri(0Q)  has  variance 


var  (pri(0o))  *  —  Pi(Qo)  U  ~  Pi(°o)) 

nr 


var  (e±(00))  -  —  Pi(©o>  M  ~  Pi(°o)l 


(2.22) 


where  n  =  nqnr  /  (nq+nr).  The  expected  value  of  Br(0k,  0k)  can  now  be 


written  in  simplified  form  by  using 


32log  pqi(Ok) 


30k  3p  . 
qi 


Pi(0O) 


3 log  Pi(0  ) 


30k 


©f 


OK 


jf 

|sb 

m 


si 


IS' 


Vfi 


hi 
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The  result  is  that  the  expected  value  of  Br(0k,  0k)  is 


E  {Br(0*,  0*)} 

-  -  ^  P1(®0)  <°k  -  9k>T  '-\-‘°k2  P1<e  1  <5k 


-  4-  l  9  { 

n  i=l  l 


(Sk.0k)T  .  11*1  > 

30k 


(2.23) 


In  the  sequel  we  will  choose  0k  =  0*k  so  that  0*k  is  a  minimum-variance 
estimate  of  Og.  This  results  in  the  second  term  being  much  smaller  than 

the  first  term  for  reasonably  large  values  of  n/s.  We  will  explicitly 
neglect  this  term  in  the  sequel. 


Evaluation  of  Br(0o,  0*k) 


From  (2.23) 


Br(0o,  0k)  = 


1  f  ,  ,  3  log  Pl(0O) 

-  7  l  Pi(®o)  (0O  “  0k)T - - 

1  3Pi2 


-  i  (®0  -  0k)T  I(0O)  (0Q  -  0k) 


where  I(0q)  is  the  information  matrix 


(0O  -  0k) 


(2.24) 


(2.25) 


■  tvwjvv'i  '.f  sr-  »  *  <  ■>  *  J1  r. 


s  3  log  Pi(©o) 

1(00)  =  I  Pi(0o)  - : — 2 - 

3P/ 


(2.26) 


a 

In  (2.23)  both  0k  and  0k  are  k-dimensional  parameter  vectors.  Here, 
however,  0q  is  an  m- dimensional  vector  (m  >  k).  To  handle  this  situation, 
we  write  0q  -  0k  e  Rm  as 


0O  -  0k 


,k  _  Qk 


where  0gk  e  Rk,  0q  e  Rm  k 


setting 


J(0k)  =  i  <0O  -  0k)T  I(0O)  0q  -  0k) 


and  minimizing  with  respect  to  0k  yields 


0*k  =  0Qk  _  lH-l  (0O)  i12  (0Q)  0O 


(2.27) 


where  we  have  partitioned  I(0q)  as 


I(®0)  = 


1 1 1 (G0>  i12(®o) 


Il2(0o)  i22<g0> 


The  minimum  value  of  J  is 


BWPBTO  WWWVD7MIW.W WJ  WJ  vvrjryr’jrj  pj  pi  rs  *t/  tii  wrn.'v  y^i'wiu'wrv  v.  v  *r  v— 

u 


J(6*k)  »  QqT  (I22(0o)  “  i12(q0)T  Ill(0o)_1  i12(®0)1  0O 
If  we  partition  the  covariance  matrix 

P(0O)  -  K0o)_1 

P1 1 (°0>  p12(0o> 

T 

_  p12(0O>  P22(0O)_ 

then 

J(0*k)  »  |  50T  ?2 2 ( 0O ) —  1  0O 

where  p22(0o)-1  =  I22  “  I l 2T  Ill_1  r12 
Since 

s 

P(0o)  =  I  pi(0)  (0o  -  0*k)  C0o  -  0*k)T 

i=  1 

•  E  ((00  *  0*k>  (0O  ~  0*k)TI 

we  get,  finally, 

E  [J(0*k)j  .  ^  (m  -  k) 

or 

E  fBr(Q0.  0*k)l  =  j  (k-m)  (2.28) 


-25- 


2.4  Unbiased  Estimate  of  Entropy 


From  (2.23) 

E^Br(0k,  0k)j  =  -  (0k  -  0k)T  l(0k)  (Qk  -  0k) 

where  I(0k)  is  the  kxk  information  matrix 


I(0k)  =  E 


V  S  32logPi(Qk) 

j,  Pl(°0)  - 


and  p^(0q)  is  given  in  (2.6).  Using  (1.5)  and  (1.6)  we  see  that 


E  |Br(0k,  0k)}  =  -  j  tr  Ik 


k 

7 


(2.29) 


where  Ik  is  the  kxk  identity  matrix. 


Combining  (2.29)  and  (2.28)  yields 

E  |Br(0k,  0O)}  *  y  -  k  (2.30) 

where  0k  is  the  maximum  likelihood  estimate  (0k  e  Rk).  This  represents  a 
bias  in  the  maximized  log-likelihood  function,  with  the  result  that  our 
goal  is  to  pick  k  such  that 
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3.  CANONICAL  VARIATES  ANALYSIS 

We  now  consider  the  linear  prediction  problem  using  the  canonical 
variates  analysis  approach. 

Let  the  past  be  represented  as  a  column  vector  P(t)  defined  by 


P(t) 


y(t) 

y(t-i) 


nxl 


and  define  the  future  as  a  column  vector 


F(t) 


y(t+l) 

y(t+2) 


m  <  n 


mxl 

where  y(t)  is  the  r-dimensional  observed  output  at  time  t.  Our  goal  is  to 
predict  the  future  F(t)  given  P(t). 

We  now  consider  the  canonical  variate  analysis  in  a  form  that  allows  us 
to  explicitly  show  the  optimality  properties  of  the  method. 

Consider  nonsingular  transformations  of  the  past  and  future 


c(t) 

-  J 

P(t) 

(3.1) 

nxl 

nxn 

nx  1 

d(t ) 

-  L 

F(t) 

(3.2) 

mxl 

mxm 

mxl 
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and  form  a  k-—  order  estimate  of  F(t) 


^kCt)  ■  I  H  ci(t) 
1-1 


(3.3) 


where  {a^}  are  mxl  vectors  and  Is  the  i—  component  of  c  (a  scalar). 
Since  a^  Is  fixed,  only  cj.(t)  depends  on  the  data.  Since  J  is  only 
constrained  to  be  nonsingular,  we  can  use  a  very  general  form  for  it. 
Without  loss  of  generality  we  can  specify  that 


E  [c(t)  c(t)T]  -  Ir 


(3.4) 


Let  B  be  an  orthonormal  matrix: 


®nxn  Bnxn 


(3.5) 


J  Spp  JT  -  Bt  B 


where  SDD  -  E  [P(t)  P(t)T] 


This  has  a  solution 


T  -1/2 

J  -  B  Spp 


(3.6) 


(3.7) 


ct  -  Jt  P(t) 


where  Ji  is  the  i—  row  of  J; 


(3.8) 


B  -  [ b i  b2  .  .  .  bn] 

nxn 

Thus 

T  -1/2 

Ci  *  bj  ^pp  P( t ) 
and  Che  estimate  Pk(t) 

-1/2 
SPP 

A  Qk  Sp1/2  P(t) 

where 


Pk(t) 


k 

l 

1*1 


ai 


T 

bl 


Qk  *  l  al  bl 
1*1 

Note  that  Qk  has  maximum  rank  k. 
The  prediction  error  Is 

-1/2 

ek(t)  -  Qk  Spp  P(t)  -  F(t) 


We  now  form  a  quadratic  cost  function 


I 

$ 

a 

i 

« 

i 

s 

a 


Lk  ”  E  tek(c)  w  ek(t) ] 


-1  .  -1/2  T  -1/2  T  T 

-  tr  W  E  {[Qk  Spp  P(t)  -  F( t ) ]  [P(t)  Spp  Qk  -  F( t ) ] } 

-  tr(W  1  Qk  Qk) 

-1  -1/2  -1 
-  2tr(W  Qk  Spp  Spf )  +  tr(W  Sff) 


where  Spf  =  E  [P(t)  F(t)T],  Sff  =  E  [F(t)  F(t)T] 

In  order  to  handle  the  orthonormality  constraints  we  add  the 
equations  via  Lagrange  multipliers  to  form  the  augmented  cost 


v  T 

Lk  =  Lk  +  I  xi  0>i  bi  -1) 
i=  1 

where  { }  are  Lagrange  multipliers.  Thus 


-  -1  k  x  n  T 

Lk  “  tr  {w  l  ai  bt  l  bj  as) 

1-1  J-I 

- 1  k  T  -1/2 

-  2  »  l«  l  ai  bt  Spp  Spf} 
i=l 


+  tr  {w  Sff} 


*  T 

+  [  xi  0>i  bf  -1) 

i=l 


T 

Using  bf  bj  =  <S ^ j ,  with  6  the  Kroneker  delta  function, 
and  rearranging  gives 


(3.15) 


constraint 


(3.16) 


(3.17) 


(3.18) 
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k  T  -1 
Lk  "  L  ai  W  at 
i-1 

k  T  -1/2  -1 

-  2  l  bf  Spp  Spf  W  af 

i»  1 


-1  k  T 

+  tr(W  Sff)  +  l  XiCbi  bi  -1) 
i  =  l 


(3.19) 


Taking  partial  derivatives: 


3Lk  T  -1  T  -1/2  -1 

3a .  =  ^  ai  w  ~  2  b^  Spp  Spf  W 


T  -1  T  -1/2  T 

-2  af  W  Spf  Spp  +  2  Xj  b[ 


Thus,  the  first  order  necessary  conditions  for  minimizing  Lk  are 


*  T  -1/2  * 

ai  =  spf  spp  bi 


*  -1/2  -1  * 
*i  bi  =  Spp  Spf  W  ai 

for  i  *  1,  2,  k. 


Eliminating  af; 


(3.20) 


(3.21) 


(3.22) 


(3.23) 


^i  bf  =  Spp  Spf  W  Spf 
which  is  an  eigenequation. 


-I  T  -1/2  * 


(3.24) 


J. 


WA 


The  first  term  of  (3.L9)  becomes 


*  *T  -i  * 
l  at  w  ai 
1-1 


k  *T  -1/2  -1  T  -1/2  * 

*  A  bf  Spp  Spf  W  Spf  Spp  bf 


k 

*  l  xi 

1-1 


The  second  term  of  (3.19)  becomes 


k  *T  -1/2  -1  T  -1/2 

-2  l  bf  Spp  Spf  W  Spf  Spp 

1-1 


k 

-  -2  l  H 

1=1 


Thus,  the  optimized  cost  Is 


*  -1 
Lk  =  tr(W  Sff) 


k 

I 

1  =  1 


Now  let 


-1/2  -1/2 

k  =  Spp  Spf  w 


(nxm)  n>m 


From  (3.24), 


★X  "  * 

bf  R  R  bj  =  Af 


By  using  a  singular  value  decomposition  on  R 


ll' 


R  =  U  D  V 


T  T 

V  V  =  I,  U  U  =  I 


Yl  0 


where  Yi  >  Y2  >  •  •  •  Ym 


T  T  T  T 

Then  RR  =UDV  VD  U 


T  T 

=  U  D  D  U 


Then,  from  (3.29) 


*T  T  T  * 

b±  U  D  D  U  bt  =  Xt 


Now  let 


U  =  [Ui  U2  .  .  .  Un ] 


(3.30) 


(3.31) 


(3.32) 


(3.33) 


(3.34) 


Thus  bj[*  is  the  eigenvector  of  U  D  whose  eigenvalue  is  X^. 


(3.35) 


where  the  are  mutually  orthogonal  unit  vectors  by  construction.  But  the 

matrix  U  D  [jT  has  eigenvectors  U|  and  associated  eigenvectors  Y^2  since 


UtT  U  D  DT  Ut  Uj  =  Yi2 


(3.3b) 


w 


s? 
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LI 


ki 


1^1 


*  T  i 

”  spf  spp 

To  determine  L  (cf  (3.2)),  we  can  use  the  condition 


E  (cd^)  »  D 


(3.43) 


J*Snf  LT  -  D 


(3.44) 


(3.45) 


From  (3.7), 


t  “i/2 

uT  spp  Spf  lT  *  D 


(3.46) 


D  -  UT  R  V 


T  -1/2  -1/2 
-  uT  spp  Spf  W  V 

Comparing  (3.46)  and  (3.47)  gives 


T  ’I/2 
Lt  -  W  V  ,  or 


L  -  VT  w 


(3.47) 


(3.48) 


* 

Note  that  A^,  the  optimal  gain  matrix  is  of  dimension  mxn  but  has  a  maximum 
rank  of  k. 

Note  that  k  <  m  since  the  symmetric  matrix  in  the  eigenequat ion 
(3.24)  has  rank  <  m.  This  is  very  important,  as  it  implies  that  we  need  to 
make  the  dimension  of  the  future  vector  (m)  at  least  as  large  as  the  maxi¬ 
mum  expected  order  of  the  estimator. 


V, 

¥ 

¥ 

•j« 

$ 

i 


,v 

» 


I 
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An  efficient  computation  of  is 


-1/2  -1/2 
Spp  ;  Spp  symmetric 

X  * 
spf  di 

k  *  *  x 

l  ai  di 
i=l 


Cholesky  Form 


The  cholesky  factorization  of  a  positive-definite  matrix  is  an  attrac¬ 
tive  way  of  computing  a  square  root  matrix.  Let 


Spp'1  -  Q  Q 

Then  we  get  the  following  relations 


*  T 

ai  -  spf  Q  ui 

*  T  -I  T 

Xt  Ui  =  Q  Spf  W  Spf  Q  Ui 


T  -1/2 

R  =  Q  Spf  W 


*  X  k  x  T 

Fk(t)  *  Spf  Q  (  l  Ui  Ut)  Q  P(t) 
i-1 


*  x  k  x  T 

Ak  =  Spf  Q  (  l  Uf  Ui)  Q 
i«l 


Trucated  Predictor 


In  the  sequel,  we  will  be  restricting  the  total  number  of  parameters 
allowed  in  the  predictor.  The  question  arises  as  how  to  best  truncate  the 
prediction  equations.  Our  approach  is  to  use  only  the  most  recent  past 
values.  For  example,  suppose  we  have  used  m  =  5  in  our  analysis,  but  wish 
only  to  use  a  one-step-ahead  predictor  with  k  parameters.  Then  our  predic- 

* 

tor  uses  only  the  first  k  elements  of  the  first  row  of  A^. 

Inclusion  of  Known  Inputs 

If  we  have  an  unknown  system  with  measured  outputs  y(t)  and  measured 
inputs  u(t),  the  analysis  of  this  section  holds  with  only  slight  modifica¬ 
tions.  If  we  augment  the  past  vector  as 


P(t)  * 


y(t) 

u(t) 

y(t-i) 

u(t-l) 


(3.49) 


then  all  of  the  analyes  of  this  section  holds  and  the  predicted  values  of 
y(t)  depend  on  both  past  values  of  y(t)  and  on  past  values  u(t). 
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4.  DIRECT  DETERMINATION  OF  STATE-SPACE  MATRICES 

We  now  consider  the  problem  of  determining  the  state-space  matrices 
directly  from  the  linear  prediction  solution.  Recall  from  Section  3: 


P(t)  = 


y(t) 

y(t-i) 


(4.1) 


nxl 


F(t)  =  A  P(t)  (4.2) 

If  we  restrict  our  problem  to  a  one-step-ahead  prediction  of  y(t),  then 


y(t+l  |  t)  =  A  P(t) 
where  A  is  mxn. 


(4.3) 


We  can  write  a  recursion  for  P(t)  as  follows: 


P(t+1)  =  M  P(t)  +  T  y(t+l) 
where 


(4.4) 


M  = 


. 0  1 

. 0 

0  I  0 . 0 


0  0 
I  0 


0 . 0  10 

where  all  submatrices  are  mxm. 


(4.5) 
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‘■mxm  I  (^*6) 

^(n-m)xm 

so  that  P(t)  is  in  recursive  form  with  a  driving  term  y(t+l). 

The  state-space  formulation  employs  a  Kalman  filter  for  updating.  The 
equations  for  the  time-invariant  filter  at  steady  state  are 


y(t+l  j  t)  =  H  x(t+l  J  t)  (4.7) 

x(t+l  |  t)  =  #  x(t  j  t)  (4.8) 

x(t+l  |  t+1)  =  x(t  +  l  ]  t)  +  K  [ y( t  +  1 )  -y( t+1  j  t)]  (4.9) 

Combining  (4.7)  -  (4.9)  yields 


y(t+l  J  t)  -  H  *  x(t  |  t)  (4. 10) 

x(t+l  j  t+1)  =  ( I-KH)  *  x(t  |  t)  +  K  y( t+1 )  (4.11) 

as  the  state-space  equation  set.  The  linear  prediction  set  is 


y ( t+1  |  t)  =  A  P(t)  (4.12) 

P(t+1)  =  M  P(t)  +  T  y( t+1 )  (4.13) 


What  we  seek  to  do  is  match  these  two  pairs  of  equations  by  finding  the 
state-space  matrices  H ,  4> ,  K  which  give  the  best  "fit"  to  the  linear  pre¬ 
diction  equations. 

Equations  (4.7)  -  (4.9)  can  also  be  put  in  the  form 
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I 


B 


y(t+l  t)  ■  H  x(t+l  t) 


x(t+l  |  t)  -  «(I  -  KH)x( t  |  t-1)  +  ♦  K  y(  t ) 


(A. 14) 


(4.15) 


We  can  solve  Che  problem  operationally  by  using  solutions  Involving  delay 
operators. 


We  will  first  solve  the  linear  prediction  equations.  From  (4.13), 


P(t+1)  »  zM  P(c+1)  +  T  y( t+l ) 
where  z  is  the  delay  operator:  z  P(t+1)  =  P( t ) 


Solving  (4.16): 


P(t+1)  *  ( I-zM)  T  y( t+1 ) 

Thus,  from  (4.12): 


(4.  16) 


(4.17) 


y(t+l  |  t)  »  A  (I-zM)  T  y(t) 

We  can  also  solve  the  state-space  equations  in  the  same  way. 


From  (4.10)  and  (4.11)  we  get 


(4.18) 


x ( t  t)  -  [I  -  ( I-KH)  *zj  K  y(t) 


(4.19) 


so  that 


y( t+1  t)  -  H  *  [I  -  (I-KH)  tz]  K  y(t) 


(4.20) 


•WJ 

■>.  X.'  V 
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‘.OriHar 
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while  from  (4.14)  and  (4.15)  we  get 


(4.22) 


.  -1 
y(t+l  I  t)  -  H  [I  -  4>(I-KH)z)  *  K  y(t) 

If  we  could  get  a  perfect  match  between  the  linear  prediction  and  the 

state-space  predictors,  then  the  following  equation  would  be  satisfied 


-1  -1 
A  (I-Mz)  T  =  H  4>[I-(I-KH)$z]  K 

or,  equivalently 


(4.23) 


-1  -1 

A  (I-Mz)  T  *  H[I-4>(I-KH)z]  <t  K  (4.24) 

Using  (4.23),  we  see  that  exact  matching  occurs,  for  dim  (x)  =  n,  if 


'  A  -  H  4 
-  M  =  (I-KH)  D 
T  =  K 

Equation  (4.26)  can  be  written  as 


(4.25) 

(4.26) 

(4.27) 


M  *  *  -  TA, 
so  that 


(4.28) 


*  -  M  +  TA  (4.29) 


In  addition,  (4.25)  gives 


H  -  A  *  _1  (4.30) 


Since  H  must  satisfy  A  =  H  (M  +  TA) ,  it  is  easily  shown  that  H  is  in  cano- 


^  “  t^mxm  ®mx(n-m)] 


If  4  is  a  valid  transition  matrix,  it  is  guaranteed  to  be  invertible.  T 
we  only  need  to  guarantee  the  invertibility  of  M  +  TA.  Using  the  defini 
tion  of  M  and  partitioning  T  and  A  appropriately,  we  see 


nxn 


Omx(n-m)  ®mxm 


^(n-m)x(n-m)  0(n-m)xm 


Amxm 

®(n-m)xm 


A1  a2 

1  0 

The  inverse  is 


-1 


L  a2 


-1 


-1 

-a2  aU 


Thus  4  *  exists  if  A2  *  exists. 


Almx(n-m)  A2mxnTj 


(4.31) 


(4.32) 


To  check  this,  write  equation  (4.15)  in 


partitioned  form  as 


[  A1  a2  1 


L  "12  "22j 


T  T 

a2  "  spfl  w12  +  spf2  w22 

Now  partition  SDD  as 


S11  s12 


>PP  "  T 

L  S12  s22 


-1  T  -1  -1 

w12  “  ~  S11  S1 2  (s22  -  s12  S11  s12> 

T  -1  -l 
w22  *  (s22  ”  s12  sll  s12) 


so  that 


a2  “  (spf2  ~ 


T  -1 

spfl  S11  S12>  <s22 


T  -I  -I 

s12  Sii  S12) 


Therefore 


-1  T  -1  T  T  -1  -1 

a2  “  (s22  “  s12  S11  S1 2 )  (spf2  "  spfl  S11  Sl2) 

Solving  for  yields 


T  T  T 

Spfl  W11  +  Spf 2  w12 


Using 


(4.35) 


(4.36) 


(4.37) 


(4.38) 


(4.39) 


(4.40) 


(4.41) 
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-1  -1  T  -1  -IT  -1 

W11  *  S11  +  S11  s12  ( s22  _  s12  S11  s12>  s12  S11  (4-42) 


we  get 


T  -1 

A1  *  spfl  S11  + 


t Spf 1  S11  s12  "  spf 2 1  ( s22  ~  S1 2  S11  s12>  s12  SIL  (4*43) 

In  summary,  we  can  use  the  direct  solution  if  the  matrix 


T  T  -1 

^ce  =  Spf2  -  Spfi  S[i  S i 2  (4.44) 

is  invertible. 


This  exact  solution  is  restricted  to  the  case  dim  (x)  =  n.  This 
implies  that 


dim(x( t ) )  =  dim(P( t )  ) 


Truncated  Filter 

Given  the  matrices  for  the  full-order  Kalman  filter  't’nxn*  Hmxn,  Knxm> 
the  question  arises  as  to  whether  there  is  a  suitable  truncation  to  a 
lower-oder  form  required  to  meet  restrictions  on  the  total  number  of  para¬ 
meters.  Using  the  forms  of  (4.27),  (4.29)  and  (4.30)  and  assuming  an  order 
k  <  n,  and  prediction  of  the  first  p  future  values  (p  <  m) ,  we  truncate  as 
follows : 


(1)  *kxk  is  the  upper  left  kxk  submatrix  of  $nxn 

(2)  Hp^  is  the  upper  left  pxk  submatrix  of  Hmxn 

(3)  KkXp  is  the  upper  left  kxp  submatrix  of  Knxm 
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5.  MODEL  SELECTION  FOR  LINEAR  PREDICTION 


We  can  now  consider  the  problem  of  model  order  selection  for  linear 
prediction  problems  using  canonical  variates  analysis.  Assume  that  the 
data  y(t)  are  Gaussian  and  consider  data  on  the  interval  [t,  s)  t<s  with 
unit  time  increment  and  let  d  =  s  -  t+1.  The  associated  data  over  the 
interval  are 

Y(t,s)  =  (y(t),  y( t+1 ) ,  ....  y( t+d-1 ) }  (5.1) 

The  one-step-ahead  prediction  error 

ek(t)  =  y C t+1 )  -  Ak  p(t)  (5.2) 

is  also  Gaussian  and,  from  (3.42)  has  zero  mean  and  variance 


E  tek(t)  ek( t )  > 


“  Sk(t  +  1  |  t) 

T  -1/2  k  -1/2 
“  Sff  -2  Spf  Spp  U  Spp  Spf 


T  -1/2  k  2  -1/2 

+  ^pf  Epp  (^  )  Spp  Spf 


(5.3) 


where 


k  £  T 

U  -  l  Ut  Ui 
i-1 


(5.4) 


Since  (U^)2  »  (5.3)  redu  ces  to 
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t+l  I  =  ^ff  ^pf  ^pp  ^  ^pp  ^pf  (5.5) 

Since  Sk(t+1  |  t)  Is  constant  for  all  t  due  to  assumed  stationarity  of 
y(t),  we  can  set  Sk  =  Sk(t+L  |  t). 

From  (3.3)  the  number  of  parameters  is  mk.  Thus 


AIC(k) 


d  T  - 1 

1  l°g  |  sk  |  +  ek(t)  sk  ek<t) 
t  =  l 


+  2  mk 


"  d [ log  |  Sk  |  +  tr  Im]  +  2  mk  (5.6) 

where  |  .  |  denotes  determinant  and  Im  is  the  mxm  identity  matrix.  The  AIC 
is  thus 


AIC(k)  =  d  log  |  Sk  j  +  m  +  2  mk  (5.7) 

Since  d  and  m  are  fixed,  we  see  that  the  optimum  order  k  is  the  one  which 
minimizes 


log  |  Sk  |  + 


2mk 

d 


(5.8) 


for  d  >  >  2(m)k  this  is  approximately  equivalent  to  minimizing 


Sk 


1  + 


2  mk 
~d 


(5.9) 


which  can  be  recognized  as  the  forward  prediction  error  (FPE),  see  Akaike 
(1973).  In  the  sequel  we  will  use  (5.8)  instead  Of  (5.9)  since  the  data 


interval  will  sometimes  be  relatively  short. 


6.  DETECTION  OF  ABRUPT  MODEL  CHANGES 

In  this  section  we  apply  the  tools  developed  so  far  to  the  problem  of 
abrupt  change  detection.  We  then  present  some  experimental  results. 

6. 1  Algorithm  Development 

Consider  the  situation  depicted  in  Figure  6.1,  in  which  the  true  time- 
series  model  changes  from  Mq  to  Mj  at  time  t-dj,  where  t  is  the  present 
time.  Suppose 


Figure  6.1  Changing  Time  Series  Model 


that  we  have  data  back  to  time  t-do  and  that  the  true  model  is  Mq  in  the 
interval  (t-dQ,  t-dj). 

We  wish  to  detect  this  change  in  the  model.  In  this  example,  fitting  a 
single  model  to  data  over  the  interval  (t-dg,  t)  should  result  in  greater 
fit  errors  than  fitting  one  model  over  the  interval  (t-do,  t— d  j )  and 
another  model  over  the  interval  (t-dj,  t).  The  crucial  issue  is  to 


determine  an  appropriate  selection  measure  so  as  to  be  sensitive  to 


changing  models,  while  at  the  same  time  not  being  too  sensitive  to  noise. 
Over-sensitivity  to  noise  will  result  in  deciding  that  model  changes  have 
occurred  when,  in  fact,  they  have  not.  Low  sensitivity  to  model  changes 
will  result  in  missing  changes  which  have  occurred  in  the  model.  Another 
obvious  problem  is  how  best  to  select  the  testing  intervals,  (t-dg,  t)  and 
(t-di,  t) ,  to  minimize  the  time  required  to  achieve  accurate  detection.  We 
consider  first  uhe  selection  measure. 

Over  the  interval  (t-do?  t)  we  can  find  the  model  which  minimizes  the 

AIC: 

d°  „k 

AIC  (k)  =  -  2  l  log  p(e( t-d0  +  i)  |  0  )  +  2  M(k)  (6.1) 

i=l 

where  M(k)  is  the  number  of  independently  adjustable  parameters  and  where  we 
have  assumed  a  sampling  time  increment  of  one,  for  convenience. 

If  we  now  divide  the  interval  (t-d()>  t)  into  two  subintervals, 

(t-do,  t-di)  and  (t-dj,  t) ,  we  determine  minimum  AIC  models  for  each  subin¬ 
terval 


AIC0( k) 


2 


0  U1 

l 

i-1 


log  p(e(t-dQ  +  i)  |  0  )  +  2  M(k) 


(6.2) 


Aiq(k) 


2 


u  „k 

l  log  p(e(t-dg  +  i)  |  0  )  +  2  M(k) 

i=do~d [+1 


(6.3) 


Now  assume  that 


k1 


arg  min  AIC(k) 


kQ*  ■  arg  min  AICo(k) 
kj*  =  arg  min  AICi(k) 

-k*  -  kg*  -ki* 

and  let  the  corresponding  models  be  parametrized  by  0  ,  0q  » 

respectively. 

Then  the  model  selection  criterion  is  based  on  comparing  AIC(k*)  with 

AICg(kg*)  +  AIC(kj*)  and  selecting  the  model(s)  which  give  the  least  value 

We  can  simplify  the  calculation  in  the  case  that  dg  >  >  d^  and  the  model 

does  not  change  too  much,  in  which  case  we  expect  that  k*  “  kg*, 
k*  kn* 

0  *  Oq^  •  *n  this  case  we  can  define  the  AIC  difference  as 

AAIC*  =  AIC(k*)  -  AICg( kg* )  -  AlC^ki*) 

d°  .k* 

=  -  2  l  log  p(e( t-dg  +  i)  |  0 

i=dg-di+l 

d°  k,* 

+2  l  log  p(e( t-dg  +  i)  |  0L  )  -  2  M(kl*)  (6.4) 

i=dg-d i+l 


and  the  decision  rule  is 


AAIC* -I 


<  0 
>  0 


;  declare  "no  change" 
;  declare  "change" 


Note  that  AAIC*  may  be  written  as 


(6.5) 


J0 


AAIC*  =  2  l  log  P(e(t  d-0+1'>  I  - -  -  2  M(kj*) 

i= ldg-d i+ 1  p(e(t-dQ+i)  j  gk*  ) 


(6.6) 
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which  is  the  likelihood  ratio  in  favor  of  the  best  model  in  the  interval 
(t-dj,  t)  to  the  best  historical  model  evaluated  over  the  same  interval, 
but  biased  off  by  the  number  of  parameters  of  the  best  model  in  the  inter¬ 
val  (t-dj,  t). 


If  we  specialize  this  result  to  the  linear  prediction  problem  under 
study  here,  we  see  that 


AAIC*  -  di  {log  |  S(k*)  j  +  tr  S(k*)  S(k*)  l} 
-  di  {log  |  SjCki*)  |  +  m} 

-  2  mki* 


(6.7) 


where  S(k*)  is  the  theoretical  covariance  matrix  of  prediction  errors  for 
the  historical  model  fitted  on  the  interval  (t-do,  t-dj),  SjCkj*)  is  the 
theoretical  covariance  matrix  of  prediction  errors  for  the  model  fitted  to 
the  data  on  the  Interval  (t-dj,  t)  ,  and  S(k*)  is  the  actual  covariance 
matrix  of  prediction  errors  for  the  historical  model,  evaluated  on  the 
interval  (t-di,  t).  Now  let  AS(k*)  =  S(k*)  -  S(k*). 


AAIC*  =  dj  log 


S(k*) 

sl(ki*) 


-2  mkj 


+  tr  AS(k*)S(k*) 


Thus  our  decision  parameter  is 


(6.8) 


Y  -  log 


S(k*) 

Vkl*) 


2  mki 
dl 


+  tr  AS(k*)S(k*) 


(6.9) 


and  the  decision  rule  Is 


f  <  0  ; 

l  >  0  ; 


declare  "no  change' 
declare  “change" 


(6.10) 


6.2  Experimental  Results 


The  abrupt  change  detector  was  tried  on  a  changing  autoregressive 
model.  On  the  interval  t  e  [l,  dg  ]  ,  the  actual  model  was 


y(t)  =  1.65  y(t-l)  -  0.665  y(t-2)  +  u(t) 


(Model  1) 


(6.11) 


where  u(t)  was  zero-mean  white  Gaussian  noise  with  variance  of  1.  This 
model  has  two  real  stable  poles  at  0.95  and  0.7.  The  actual  model  was  then 
changed  to 


y(t)  =  2.5  y(t-l)  -  2.11  y(t-2)  +  0.595  y(t-3)  +  u(t)  (Model  2)  (6.12) 

on  the  interval  t  e  [dg  +  1,  dg  +  d , ] .  This  model  has  three  poles  at 
0.7,  0.9  +  0. 2i ,  0.9  -  0.2i. 


The  first  trial  used  dg  =  80,  dj  =  20.  The  resulting  covariance  matri¬ 
ces  on  the  interval  [1,  dg]  were 


11.0808 

10.9642 

10.7662 

10.5378 

10.2860 


10.9642 

11.0018 

10.8992 

10.7144 

10.4733 


10.7662 
10.8992 
10 . 9484 
10.8566 
10.6614 


10.5378 

10.7144 

10.8566 

10.9144 

10.8144 


10.2860 

10.4733 

10.6614 

10.8144 

10.8619 


MMX 


1 


m 


-ryrV^V^ 


11.0439 

10.8318 

10.5900 

10.3510 

10.0978 


10.9089 

10.6534 

10.4015 

10.1607 

9.9061 


10.7124 

10.4500 

10.1993 

9.9542 

9.6859 


11.1612 

11.1214 

10.9673 

10.7430 

10.4764 


11.1214 

11.2356 

11.1760 

10.9937 

10.7336 


10.9673 

11.1760 

11.2697 

11.1831 

10.9711 


By  performing  an  SVD,  we  obtain 


0.6655 

0.4239 

0.3888 

0.3596 

0.3112 


0.4685 

■0.2456 

0.2135 

-0.1033 

-0.8148 


-0 . 4465 
0 .7166 
0.3803 
-0 . 1198 
-0 .3579 


7 .2182 
0 
0 
0 
0 


0 

0.1863 

0 

0 

0 


0 

0 

0.1032 

0 

0 


10.4822 
10.2258 
9 . 9753 
9.7122 
9.4296 


10.7430 

10.9937 

11.1831 

11.2537 

11.1474 


0.3351 
0.4911 
■0.7320 
•0 . 3149 
■0  .  1072 


0 . 0207 
0 


10.2222 
9.9723 
9 .7098 
9 . 4267 
9.1241 


10.4764 

10.7336 

10.9711 

11.1474 

11.2135 


-0.1607 
0 . 0726 
-0 . 3504 
0 .8640 
-0.3157 


0 

0 

0 

0 

0 .0165 


ivi 

r» 


& 


»» 
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0.4605 

0.4569 

0.4493 

0.4398 

0.4289 


-0.6945 

-0.2407 

0.0706 

0.3337 

0.5860 


0.5094 
-0.4337 
-0. 5097 
-0.0833 
0.5344 


-0. 1666 
0.4939 
-0.0174 
-0.7382 
0.4279 


0.1355 
-0.5489 
0.7301 
-0.3787 
0 . 0627 


The  resulting  values  of  AIC(k)  for  different  orders  k  are,  neglecting 
constants , 


AIC(l)  =  -  2.150 

AIC(2)  -  -  2.264  — - 

AIC( 3 )  =  -  2.243 
AIC(4 )  =  -  2.195 

so  that  k*  =  2,  which  is  the  correct  order,  is  selected, 
model  is  y(t)  =  1.843  y(t-l)  -  1.0081  y(t-2). 


The  estimated 


On  the  interval  [do  +  1,  do  +  d I  ,  the  covariance  matrices  were 


1.7647 

1.7624 

1.6008 

1.3214 

1 . 1104 

1.7624 

1.9451 

1 . 9171 

1. 7039 

1.4781 

1.6008 

1.9171 

2.0772 

2.0067 

1.8375 

1.3214 

1.7039 

2.0067 

2  .  1332 

2.0981 

1.3214 

1.7039 

2.0067 

2  .  1332 

2.0981 

'  1.6394 

1.4902 

1.4677 

1 . 6660 

2  .  1622 

1.4905 

1.2481 

1.1833 

1 . 3743 

1.8518 

1.2626 

1.0238 

1.0057 

1 .2397 

1  .  7310 

0.9897 

0.8111 

0,8554 

1 .  1288 

1.6058 

.  0.8331 

0.7084 

0.7783 

1.0178 

1.4219 

* 

w 


f  f2 


1.7259 

1.7733 

1.9101 

2.2212 

2.7981 

1.7733 

2.1250 

2.5776 

3.1739 

3.9927 

1.9101 

2.5776 

3.4770 

4.5339 

5.7802 

2.2212 

3.1739 

4.5339 

6.1817 

8.0268 

2.7981 

3.9927 

5.7802 

8.0268 

10.5912 

Performing  the  SVD  yields. 


0.9394 

-0.1382 

-0.2689 

0.0560 

-0.1513 

0.0073 

-0.8145 

0.4181 

0.3600 

0.1791 

0.1124 

-0.1507 

0.5108 

-0.7606 

-0.3538 

0.2936 

0.5410 

0.6957 

0.3074 

0.2065 

0.1361 

-0.0455 

-0.0891 

-0.4408 

0.8816 

3.5041 

0 

0 

0 

0 

0 

0.6770 

0 

0 

0 

0 

0 

0.2473 

0 

0 

0 

0 

0 

0.0326 

0 

0 

0 

0 

0 

0.0094 

0.3352 

-0.7554 

0.4483 

-0.2558 

0.2250 

0.3611 

-0.3926 

-0.3766 

0.6815 

-0.3305 

*  4 

0.4033 

-0.0185 

-0.6095 

-0.6546 

-0.1926  . 

0.4766 

0.2819 

-0.1621 

0.2039 

0.7909 

0.6062 

0.4422 

0.5093 

0.0107 

-0.4213 

•> 

The  resulting 

values  of 

AIC(k)  are, 

again  neglecting  co 
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AIC(I)  -  -  0.960 

AIC(2)  -  -  2.266 

AIC(3)  =■  -  2.323  — - 

AIC(4)  -  -  2.223 

AIC(5)  -  -  2.L23 

so  that  k*  =  3,  as  desired.  The  estimated  model  is 

y(t)  =  1.9456  y(t-l)  -  1.1485  y(t-2)  +  0.0483  y(t-3) 

Note  that,  with  the  sparse  amount  of  data  available,  the  coefficient 
errors  are  relatively  large  and  the  two  estimated  models  are  relatively 
close  to  each  other. 

The  AAIC  criterion  was  used  to  test  for  a  change  in  the  time  series 
coefficients.  Since  we  have  only  one  output,  the  criterion  is 


S(k*) 

Sl(ki*) 


S(k*)  -  S(k* ) 
S(k*) 


(6.13) 


Using  k*  =  2,  kL*  =  3,  S(k*)  =  .0940,  9^^*)  =  .0726  ,  S(k*)  =  .0903, 
di  =  20  yields, 


AAIC  =  0.2583  -  0.0394  -  0.3  =  -  0.0811 
so  that  a  "no  change"  decision  is  made,  but  just  barely.  Note  that  the 
actual  covariance  on  the  second  interval  using  Model  #1  is  actually  less 
than  for  the  first  interval,  as  a  result  of  using  only  a  small  testing 
interval. 

We  next  tried  the  test  over  larger  intervals,  keeping  a  4:1  ratio  bet¬ 
ween  the  historical  interval  and  the  testing  interval.  The  intervals  used 
were  160  for  the  historical  interval  and  40  for  the  testing  interval. 


The  covariance  matrices  for  the  historical  interval  were 


6.4177 

6.3504 

6.2021 

6.0182 

5.8335 

6.3504 

6.4177 

6.3504 

6.2022 

6.0183 

6.2021 

6.3504 

6.4170 

6.3494 

6.2006 

6.0182 

6.2022 

6.3494 

6.4153 

6.3470 

5.8335 

6.0183 

6.2006 

6.3470 

6.4119 

6.3504 

6.2014 

6.0149 

5.8244 

5.6495 

6.2013 

6.0147 

5.8242 

5.6493 

5.4869 

6.0169 

5.8282 

5.6544 

5.4919 

5.3310 

5.8316 

5.6606 

5.4999 

5.3387 

5.1657 

5.6655 

5.5089 

5.3502 

5.1770 

4.9871 

6.4187 

6.3529 

6.2057 

6.0202 

5.8294 

6.3529 

6.4251 

6.3642 

6.2194 

6.0332 

6.2057 

6.3642 

6.4436 

6.3856 

6.2387 

6.0202 

6.2194 

6.3856 

6.4671 

6.4058 

5.8294 

6.0332 

6.2387 

6.4058 

6.4835 
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The  results  of  the  SVD  were 


0.6995 

0.3997 

0.3573 

0.3481 

0.3197 


0.4349 

-0.7769 

-0.0762 

0.3521 

-0.2785 


-0.2706 

0.1851 

0.5794 

0.3443 

-0.6620 


0.4797 
0.3162 
-0.3109 
-  0 . 4460 
-0.6118 


0.1353 

■0.3202 

0.6589 

-0.6613 

0.0879 


5.3574 

0 

0 

0 
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0 
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v  = 


0.4693 

-0.8139 

0.1915 

-0.2626 

0.1082 

0.4616 

-0.1011 

-0.5037 

0.6123 

-0.3847 

=  ^ 

0.4489 

0.3182 

-0.4470 

-0.2359 

0.6647 

0.4342 

0.3743 

0.1072 

-0.5518 

-0.5962 

0.4204 

0.2933 

0.7059 

0.4426 

0.2075 

The 

values  of 

AIC  (k) 

were 

AIC(l) 
AIC( 2 ) 
AIC(3) 
AIC(4 ) 
AIC( 5) 


2.3072 

2.4871 

2.4795 

2.467 

2.4545 


so  that  k*  =  2  is  selected.  The  estimated  model  is 


y(t)  =  1.6777  y(t-l)  -  0.7178  y(t-2) 

which  is  much  closer  to  the  actual  model  (6.11),  due  to  the  increased  data  length. 


The  model  over  the  testing  interval  was  next  found.  The  covariance  matri¬ 
ces  were 


PP2 


20.6940 

20.2655 

19.2378 

17.6822 

15.7286 


20.2655 

20.4909 

20.0699 

19.0219 

17.4532 


19.2378 

20.0699 

20.3055 

19.8666 

18.8081 


17.6822 

19.0219 

19.8666 

20.0831 

19.6334 


15.7286 

17.4532 

18.8081 

19.6334 

19.8397 


pf  2 


20.4906 

19.4580 

17.9266 

15.9897 

13.7888 


19.6758 

18.1453 

16.2349 

14.0540 

11.7306 


18.3328 

16.4322 

14.2789 

11.9786 

9.6332 


16.5805 

14.4431 

12.1690 

9.8473 

7.5625 


14.5504 
12  .  2933 
9.9935 
7 . 7296 
5 . 5657 
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'  20.9398 

20.7220 

19.8651 

18.4736 

16.6762 

20.7220 

21.1467 

20.8735 

19.9608 

18.5264 

s  =  * 

19.8651 

20.8735 

21.2276 

20.8927 

19 . 9440 

f  fZ 

18.4736 

19.9608 

20.8927 

21.1853 

20.8230 

.  16.6762 

18.5264 

19.9440 

20.8230 

21.0965 

The 

SVD  yielded 
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0.0935 
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U2 

—  * 

0.2259 

■0.4936 
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0.4613 
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-0.6807 
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0 
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0 
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0 

0 
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0 

0 

0 

0.0158 

0 

0 

0 

0 

0 

0.0117 
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l 

0.4557 

-0.6682 

0.5738 

-0.1019 

0.0784 

0.4670 

-0.2754 

-0.5643 

0.6055 

-0.1447 

V  =  « 

0.4617 

0.0760 

-0.4483 

-0.6411 

0.4112 

z 

0.4410 

0.3630 

0.1371 

-0.2325 

-0.7752 

0.4081 

0.5832 

0.3640 

0.3974 

0.4504 

The  resulting  values  of  AIC(k)  were 


AIC(l)  =  0.3771 
AIC( 2 )  =  -  2.7253 
AIC(3)  -  -  2.7595 
AIC(4)  =  -  2.6753 
AIC(5)  =  -  2.6253 


so  that  kj*  =  3  was  selected.  The  resulting  estimated  model  was 


y(t)  =  2.3931  y(t-l)  -  1.977  y(t-2)  +  0.7421  y(t-3) 
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which  is  much  closer  to  the  actual  model  (6.12)  than  the  estimate  based  on 
half  as  many  data  points. 

The  AAIC  criterion  was  then  applied  to  test  for  a  model  change.  Using 
(6.13)  with  k*  =  2,  k^  =  3,  S(k*)  =  0.0811,  S^k!*)  =  0.0564, 

S(k*)  =  0.1119,  we  get 

AAIC  =  0.3632  +  0.3801  -  0.15  =  0.5933  (6.14) 

which  yields  a  "change"  decision.  By  comparing  (6.14)  to  (6.13)  we  note 
several  things.  The  first  term,  which  is  log  S(k*)  -  log  S i ( k j * )  now  more 
strongly  indicates  a  change,  due  to  better  model  fit.  The  second  term, 
which  is  the  effect  of  modeling  error  on  the  measured  error  covariances, 
also  more  strongly  indicates  a  change  due  to  increased  data  length,  which 
produces  a  more  accurate  estimate  of  the  true  error  covariance  during  the 
testing  interval,  using  the  “no  change"  hypothesis.  Finally,  the  last  term 
more  strongly  indicates  a  change,  since  the  bias  for  a  "no  change"  decision 
is  reduced  due  to  increased  data  length.  Thus  we  see  that  all  three  terms 
in  the  AAIC  criterion  contribute  to  the  final  decision,  and  each  one  is  of 
importance  in  achieving  an  accurate  decision. 


7.  DETECTION  OF  SLOW  MODEL  CHANGES 


We  now  consider  the  detection  of  slow,  essentially  continuous,  model 
changes.  What  we  wish  to  achieve  in  this  case  is  an  appropriate  data 
length  over  which  to  fit  models.  If  the  data  length  is  too  short,  then 
there  will  be  a  tendency  to  overfit  the  model  to  the  noisy  data,  leading 
to  larger  prediction  errors.  If  the  data  length  is  too  long,  then  the 
effects  of  parameter  variations  will  begin  to  dominate  the  prediction 
errors. 

In  order  to  generate  an  appropriate  measure  by  which  to  trade  off  thes 
two  characteristics,  we  again  use  the  AIC  criterion,  but  in  a  different 
way.  Assume  we  have  data  over  a  time  interval  I  =  (1,2,  ...,  n}  and  sup¬ 
pose  we  divide  this  interval  into  subintervals  of  length  W: 

I[  =  U,2,  ...  ,  W} 

1 2  =  {W+l ,  W+2 ,  ....  2W}  , 
etc. 

Then  an  appropriate  measure  for  data  length  determination  is  the  average 
per  sample  entropy.  In  terms  of  the  AIC  criterion  we  define 

^AICp(t{) 

where  AICp(k^*)  is  the  minimum  prediction  AIC  for  the  i ^  interval  If,  k^* 
is  the  optimal  model  order  for  the  i^  interval,  and  is  the  number  of 
intervals  of  W  over  the  whole  data  interval  I.  The  prediction  AIC  uses  th 
forward  prediction  error  variance  (cf  eq.  (5.9))  rather  than  the  fit  error 


variance,  since  we  are  interested  in  the  error  over  the  next  interval,  not 
the  one  over  which  the  model  was  fitted.  This  has  the  effect  of  increasing 
the  penalty  on  the  number  of  parameters  in  the  AIC  criterion. 

The  form  of  AICp  is 

AlCpCki*)  =  AlCCkj*)  +  M(ki*) 

Experimental  Results 

In  order  to  test  this  criterion  as  the  basis  for  data  length  selection, 
we  considered  a  second-order  AR  model 

y(t)  =  ai(t)  y(t-l)  +  a2(t)  y(t-2)  +  n(t) 
where  n(t)  was  zero-mean  white  gaussian  noise  with  unit  variance.  The  time 
varying  coefficients  were  selected  so  that  the  two  system  poles  were  on  the 
unit  circle  in  the  z-plane.  This  yields:  ai(t)  =  2  cos  0(t),  a2  =  -1  where 
the  roots  are:  cos  0(t)  +  i  sin  0(t)  and  cos  0(t)  -  i  sin  0(t).  The  time- 
variation  of  0(t)  was  selected  as  0(t)  =  0(0)  +  2  n  f  t,  where  f  is  a 
selected  frequency.  Two  values  of  f  (.0001,  .001)  were  used  in  the  experi¬ 
ments.  Total  data  length  was  1000  time  points.  The  results  for  Case  1 
(f  =  0.001)  are  shown  in  Table  7.1,  using  0(0)  =  0.2.  The  result  is  that 
the  optimal  indicated  data  length  is  10-12  samples  and  corresponds  to  the 
case  in  which  the  average  coefficient  change  over  the  fit  window  is  in  the 
range  of  0.80  -  0.96.  Over  the  entire  data  length  of  1000  samples,  the 
value  of  a\  starts  at  1.64,  deceases  to  -  1.90  at  t  =  400  and  then 
increases  to  1.90  at  t  =  1000.  Thus,  the  average  coefficient  change  over 
the  optimum  data  length  is  generally  more  than  40%  of  the  coefficient 


Table  2  shows  the  results  for  Case  2  (f  ■  0.0001)  in  which  the  optimal 
data  length  is  found  to  be  30  samples.  This  is,  of  course,  increased  over 
that  of  Case  1  since  the  coefficients  vary  much  less  rapidly  -  on  the  order 
of  0.019,  on  the  average.  Note  that  the  rms  prediction  error  oe  evaluated 
over  the  fit  set  generally  decreases  monotonically  with  data  length  and 
cannot  be  used  as  a  selection  criterion. 
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n 

to 


W  =  data  length  of  window 


AlCy  =  average  AIC 

AA1  *  average  change  of  A1  over  window 
ae  =  rms  one-step-ahead  prediction  error  over  fit  window 
°al  =  rms  error  in  al  estimate 
aa2  =  rms  error  in  a2  estimate 
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Table  7.2 

Data  Length  Estimation  for  Case  2  (f  *  0.0001) 


1.959 

2.131 

2.152 

2.153 
2.052 
1.835 


AAl 

°e 

°al 

°a2 

0.061 

0.519 

0.016 

0.018 

0.031 

0.486 

0.032 

0.031 

0.025 

0.464 

0.055 

0.049 

0.019 

0.437 

0.085 

0.068 

0.012 

0.462 

0.507 

0.046 

0.0061 

0.455 

0.065 

0.062 
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APPENDIX 


MAXIMUM  LIKELIHOOD  ESTIMATION 

We  prasent  here  some  basic  background  on  maximum  likelihood  estimation 
which  is  used  throughout  this  report. 

The  likelihood  function  for  a  sample  xj ,  X2 ,  .  •  •  xn  parametrized  by 
parameter  0  is 


L  =  p(xj  I  0) 


(A.  1  ) 


Assume  the  xj  are  drawn  independently  from  the  true  distribution 
p(x  |  0()).  Then  L  is  the  joint  distribution  function  of  xj  ,  X2 ,  •  •  .  ,  xn 
and 


/  *  *  J  L  dx )  ■  ■  *  dXj-j  ~  I 

Differentiating  wrt  0: 


(A. 2) 
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(A. 3) 


Differentiating  again 
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(A. 4) 


Now  define  the  covariance  matrix 


C  *  E 


/  3l°g  L\  I  /  3  log  l\  | 
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(A. 5) 


and  factor  C  as  C  *  W  W  . 


Write  (A. 4)  as 


3  log  L 
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W~T  =  (0O  -  0)T  l2i-°-g.  .L  W*T 
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The  right  hand  side  is  approximated  as 


A  rp  — T™  A  rp 

(0o  -  0)A  C  W  -  (0O  -  0)1  W 


The  left  hand  side  is  a  normalized  gaussian  variate  since 
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Thus,  the  right  hand  side  is  also  a  normalized  gaussian  variate  and 
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E  [ ( 0q  -  0)  (0O  -  0)T]  =  C  1  (A. 6) 

C  is  the  Fisher  information  matrix,  which  is  the  inverse  of  the  covariance 
matrix  of  the  parameter  estimation  errors. 


